### Write-Through

- On data-write hit, could just update the block in cache
  - But then cache and memory would be inconsistent
- Write through: also update memory
- But makes writes take longer
  - e.g., if base CPI = 1, 10% of instructions are stores, write to memory takes 100 cycles
    - Effective CPI = 1 + 0.1×100 = 11
- Solution: write buffer
  - Holds data waiting to be written to memory
  - CPU continues immediately
    - Only stalls on write if write buffer is already full



#### Write-Back

- Alternative: On data-write hit, just update the block in cache
  - Keep track of whether each block is dirty
- When a dirty block is replaced
  - Write it back to memory
  - Can use a write buffer to allow replacing block to be read first

#### **Write Allocation**

- What should happen on a write miss?
- Alternatives for write-through
  - Allocate on miss: fetch the block
  - Write around: don't fetch the block
    - Since programs often write a whole block before reading it (e.g., initialization)
- For write-back
  - Usually fetch the block



#### **In-Class Exercise**

- Can work in groups of two
  - Include your names on the paper!
- 1) Sketch a direct-mapped cache and assume:
  - A block size of 8 bytes
  - Contains 16 blocks (128 total bytes of data)
  - The main memory contains 4096 bytes of storage
- 2) Show how the address bits might map to the cache.
- 3) What block would byte 1786 map into?

# **Example: Intrinsity FastMATH**

- Embedded MIPS processor
  - 12-stage pipeline
  - Instruction and data access on each cycle
- Split cache: separate I-cache and D-cache
  - Each 16KB: 256 blocks × 16 words/block
  - D-cache: write-through or write-back
- SPEC2000 miss rates

I-cache: 0.4%

D-cache: 11.4%

Weighted average: 3.2%



# **Example: Intrinsity FastMATH**



#### **Main Memory Supporting Caches**

- Use DRAMs for main memory
  - Fixed width (e.g., 1 word)
  - Connected by fixed-width clocked bus
    - Bus clock is typically slower than CPU clock
- Example cache block read
  - → 1 bus cycle for address transfer
- → 15 bus cycles per DRAM access
- → 1 bus cycle per data transfer
- For 4-word block, 1-word-wide DRAM
  - Miss penalty = 1 + 4×15 + 4×1 = 65 bus cycles
  - Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle



#### **Measuring Cache Performance**

- Components of CPU time
  - Program execution cycles
    - Includes cache hit time
  - Memory stall cycles
    - Mainly from cache misses
- With simplifying assumptions:

Memory stall cycles

# Cache Performance Example

- Given
  - I-cache miss rate = 2%
  - D-cache miss rate = 4%
  - Miss penalty = 100 cycles
  - Base CPI (ideal cache) = 2
  - Load & stores are 36% of instructions
- Miss cycles per instruction
  - I-cache: 0.02 × 100 = 2
  - D-cache:  $0.36 \times 0.04 \times 100 = 1.44$
- -> Actual CPI = 2 + 2 + 1.44 = 5.44
  - Ideal CPU is 5.44/2 =2.72 times faster



#### **Average Access Time**

- Hit time is also important for performance
- Average memory access time (AMAT)
  - AMAT = Hit time + Miss rate × Miss penalty
- Example
  - CPU with 1ns clock, hit time = 1 cycle, miss penalty = 20 cycles, I-cache miss rate = 5%
  - $\blacksquare$  AMAT = 1 + 0.05 × 20 = 2ns
    - 2 cycles per instruction

# **Performance Summary**

- When CPU performance increased
  - Miss penalty becomes more significant
- Decreasing base CPI
  - Greater proportion of time spent on memory stalls
- Increasing clock rate
  - Memory stalls account for more CPU cycles
- Can't neglect cache behavior when evaluating system performance



#### **Associative Caches**

Address:

0×08

- Fully associative
  - Allow a given block to go in any cache entry
  - Requires all entries to be searched at once
  - Comparator per entry (expensive)
- n-way set associative
  - Each set contains n entries
  - Block number determines which set <sup>0</sup>/<sub>7</sub>
    - (Block number) modulo (#Sets in cache)
  - Search all entries in a given set at once
  - n comparators (less expensive)



# **Associative Cache Example**



# **Spectrum of Associativity**

#### For a cache with 8 entries

One-way set associative (direct mapped)



Two-way set associative



Four-way set associative

| Set | Tag | Data | Tag | Data | Tag | Data | Tag | Data |
|-----|-----|------|-----|------|-----|------|-----|------|
| 0   |     |      |     |      |     |      |     |      |
| 1   |     |      |     |      |     |      |     |      |

Eight-way set associative (fully associative)

| Tag | Data |
|-----|------|-----|------|-----|------|-----|------|-----|------|-----|------|-----|------|-----|------|
|     |      |     |      |     |      |     |      |     |      |     |      |     |      |     |      |

# **Associativity Example**

- Compare 4-block caches
  - Direct mapped, 2-way set associative, fully associative
  - Block access sequence: 0, 8, 0, 6, 8

    Tydex
- Direct mapped

| Block   | Cache | Hit/miss | Cache content after access |   |        |   |  |  |  |  |
|---------|-------|----------|----------------------------|---|--------|---|--|--|--|--|
| address | index |          | 0                          | 1 | 2      | 3 |  |  |  |  |
| 0       | 0     | miss     | Mem[0]                     |   |        |   |  |  |  |  |
| 8       | 0     | miss     | Mem[8]                     |   |        |   |  |  |  |  |
| 0       | 0     | miss     | Mem[0]                     |   |        |   |  |  |  |  |
| 6       | 2     | miss     | Mem[0]                     |   | Mem[6] |   |  |  |  |  |
| 8       | 0     | miss     | Mem[8]                     |   | Mem[6] |   |  |  |  |  |

# **Associativity Example**

#### 2-way set associative

| Block   | Cache | Hit/miss | (      | Cache conter | t after access |  |  |
|---------|-------|----------|--------|--------------|----------------|--|--|
| address | index |          | Se     | et 0         | Set 1          |  |  |
| 0       | 0     | miss     | Mem[0] |              |                |  |  |
| 8       | 0     | miss     | Mem[0] | Mem[8]       |                |  |  |
| 0       | 0     | hit      | Mem[0] | Mem[8]       |                |  |  |
| 6       | 0     | miss     | Mem[0] | Mem[6]       |                |  |  |
| 8       | 0     | miss     | Mem[8] | Mem[6]       |                |  |  |

#### Fully associative

| Block   | Hit/miss | Cache content after access |        |        |  |  |  |  |  |  |
|---------|----------|----------------------------|--------|--------|--|--|--|--|--|--|
| address |          |                            |        |        |  |  |  |  |  |  |
| 0       | miss     | Mem[0]                     |        |        |  |  |  |  |  |  |
| 8       | miss     | Mem[0]                     | Mem[8] |        |  |  |  |  |  |  |
| 0       | hit      | Mem[0]                     | Mem[8] |        |  |  |  |  |  |  |
| 6       | miss     | Mem[0]                     | Mem[8] | Mem[6] |  |  |  |  |  |  |
| 8       | hit      | Mem[0]                     | Mem[8] | Mem[6] |  |  |  |  |  |  |

# **How Much Associativity**

- Increased associativity decreases miss rate
  - But with diminishing returns
- Simulation of a system with 64KB
   D-cache, 16-word blocks, SPEC2000
  - 1-way: 10.3%
  - 2-way: 8.6%
  - 4-way: 8.3%
  - 8-way: 8.1%

#### **Set Associative Cache Organization**





### Replacement Policy

- Direct mapped: no choice
- Set associative
  - Prefer non-valid entry, if there is one
  - Otherwise, choose among entries in the set
- Least-recently used (LRU)
  - Choose the one unused for the longest time
    - Simple for 2-way, manageable for 4-way, too hard beyond that
- Random
  - Gives approximately the same performance as LRU for high associativity



#### **Multilevel Caches**

- Primary cache attached to CPU
  - Small, but fast
- Level-2 cache services misses from primary cache
  - Larger, slower, but still faster than main memory
- Main memory services L-2 cache misses
- Some high-end systems include L-3 cache

